15 research outputs found

    HIV drug resistance prediction with weighted categorical kernel functions

    Get PDF
    Background: Antiretroviral drugs are a very effective therapy against HIV infection. However, the high mutation rate of HIV permits the emergence of variants that can be resistant to the drug treatment. Predicting drug resistance to previously unobserved variants is therefore very important for an optimum medical treatment. In this paper, we propose the use of weighted categorical kernel functions to predict drug resistance from virus sequence data. These kernel functions are very simple to implement and are able to take into account HIV data particularities, such as allele mixtures, and to weigh the different importance of each protein residue, as it is known that not all positions contribute equally to the resistance. Results: We analyzed 21 drugs of four classes: protease inhibitors (PI), integrase inhibitors (INI), nucleoside reverse transcriptase inhibitors (NRTI) and non-nucleoside reverse transcriptase inhibitors (NNRTI). We compared two categorical kernel functions, Overlap and Jaccard, against two well-known noncategorical kernel functions (Linear and RBF) and Random Forest (RF). Weighted versions of these kernels were also considered, where the weights were obtained from the RF decrease in node impurity. The Jaccard kernel was the best method, either in its weighted or unweighted form, for 20 out of the 21 drugs. Conclusions: Results show that kernels that take into account both the categorical nature of the data and the presence of mixtures consistently result in the best prediction model. The advantage of including weights depended on the protein targeted by the drug. In the case of reverse transcriptase, weights based in the relative importance of each position clearly increased the prediction performance, while the improvement in the protease was much smaller. This seems to be related to the distribution of weights, as measured by the Gini index. All methods described, together with documentation and examples, are freely available at https://bitbucket.org/elies_ramon/catkern.Peer ReviewedPostprint (published version

    kernInt : A Kernel Framework for Integrating Supervised and Unsupervised Analyses in Spatio-Temporal Metagenomic Datasets

    Get PDF
    The advent of next-generation sequencing technologies allowed relative quantification of microbiome communities and their spatial and temporal variation. In recent years, supervised learning (i.e., prediction of a phenotype of interest) from taxonomic abundances has become increasingly common in the microbiome field. However, a gap exists between supervised and classical unsupervised analyses, based on computing ecological dissimilarities for visualization or clustering. Despite this, both approaches face common challenges, like the compositional nature of next-generation sequencing data or the integration of the spatial and temporal dimensions. Here we propose a kernel framework to place on a common ground the unsupervised and supervised microbiome analyses, including the retrieval of microbial signatures (taxa importances). We define two compositional kernels (Aitchison-RBF and compositional linear) and discuss how to transform non-compositional beta-dissimilarity measures into kernels. Spatial data is integrated with multiple kernel learning, while longitudinal data is evaluated by specific kernels. We illustrate our framework through a single point soil dataset, a human dataset with a spatial component, and a previously unpublished longitudinal dataset concerning pig production. The proposed framework and the case studies are freely available in the kernInt package at https://github.com/elies-ramon/kernInt

    Eradication of common pathogens at days 2, 3 and 4 of moxifloxacin therapy in patients with acute bacterial sinusitis

    Get PDF
    BACKGROUND: Acute bacterial sinusitis (ABS) is a common infection in clinical practice. Data on time to bacteriologic eradication after antimicrobial therapy are lacking for most agents, but are necessary in order to optimize therapy. This was a prospective, single-arm, open-label, multicenter study to determine the time to bacteriologic eradication in ABS patients (maxillary sinusitis) treated with moxifloxacin. METHODS: Adult patients with radiologically and clinically confirmed ABS received once-daily moxifloxacin 400 mg for 10 days. Middle meatus secretion sampling was performed using nasal endoscopy pre-therapy, and repeated on 3 consecutive days during treatment. Target enrollment was 30 bacteriologically evaluable patients (pre-therapy culture positive for Streptococcus pneumoniae, Haemophilus influenzae or Moraxella catarrhalis and evaluable cultures for at least Day 2 and Day 3 during therapy visits), including at least 10 each with S. pneumoniae or H. influenzae. RESULTS: Of 192 patients enrolled, 42 were bacteriologically evaluable, with 48 pathogens isolated. Moxifloxacin was started on Day 1. Baseline bacteria were eradicated in 35/42 (83.3%) patients by day 2, 42/42 (100%) patients by day 3, and 41/42 (97.6%) patients by day 4. In terms of individual pathogens, 12/18 S. pneumoniae, 22/23 H. influenzae and 7/7 M. catarrhalis were eradicated by day 2 (total 41/48; 85.4%), and 18/18 S. pneumoniae and 23/23 H. influenzae were eradicated by day 3. On Day 4, S. pneumoniae was isolated from a patient who had negative cultures on Days 2 and 3. Thus, the Day 4 eradication rate was 47/48 (97.9%). Clinical success was achieved in 36/38 (94.7%) patients at the test of cure visit. CONCLUSION: In patients with ABS (maxillary sinusitis), moxifloxacin 400 mg once daily for 10 days resulted in eradication of baseline bacteria in 83.3% of patients by Day 2, 100% by Day 3 and 97.6% by Day 4

    Kernel approaches for complex phenotype prediction

    Get PDF
    La relació entre fenotip i informació genotípica és considerablement intricada i complexa. Els mètodes d’aprenentatge automàtic s’han utilitzat amb èxit per a la predicció de fenotips en un gran ventall de problemes dins de la genètica i la genòmica. Tanmateix, les dades biològiques sovent estan estructurades i són de tipus “no estàndard”, el que pot suposar un repte per a la majoria de mètodes d’aprenentatge automàtic. Entre aquestos, els mètodes kernel proporcionen un enfocament molt versàtil per manejar diferents tipus de dades i problemes mitjançant la utilització d’una família de funcions anomenades de kernel. L’objectiu principal d’aquesta tesi doctoral és el desenvolupament i l’avaluació d’estratègies de kernel específiques per a la predicció fenotípica, especialment en problemes biològics amb dades o dissenys experimentals de tipus estructurat. A la primera part, utilitzam seqüències de proteasa, transcriptasa inversa i integrasa per predir la resistència del VIH a fàrmacs antiretrovirals. Proposam dos kernels categòrics (Overlap i Jaccard) que tenen en compte les particularitats de les dades de VIH, com per exemple les barreges d’al·lels. Els kernels proposats es combinen amb Support Vector Machines (SVM) i es comparen amb dos kernels estàndard (Linear i RBF) i dos mètodes que no són de kernel: els boscos aleatoris (RF) i un tipus de xarxa neuronal (el perceptró multicapa). També incloem en els kernels la importància relativa de cada posició de la proteïna pel que fa a la resistència. Els resultats mostren que tenir en compte la naturalesa categòrica de les dades i la presència de barreges millora sistemàticament la predicció. L’efecte de ponderar les posicions per la seua importància és més gran en la transcriptasa inversa i en la integrasa, el que podria estar relacionat amb les diferències que hi ha entre els tres enzims pel que fa als patrons de mutació per adquirir resistència a fàrmacs antiretrovirals. A la segona part, ampliam l’estudi anterior per considerar no-independència entre les posicions de la proteïna. Representam les proteïnes com a grafs i ponderam cada aresta entre dos residus per la seua distància euclidiana, calculada a partir de dades de cristal·lografia de rajos X. A continuació, els aplicam un kernel per a grafs (el random walk exponential kernel) que integra els kernels Overlap i Jaccard. A pesar dels avantatges potencials d’aquest kernel, no aconseguim millorar els resultats obtinguts en la primera part. A la tercera part, proposam un kernel framework per unificar les anàlisis supervisades i no supervisades en el camp del microbioma. Aprofitam la mateixa matriu de kernel per predicció mitjançant SVM i visualització mitjançant anàlisi de components principals amb kernels (kPCA). Discutim com transformar mesures de beta-diversitat en kernels, i definim dos kernels per a dades composicionals (Aitchison-RBF i compositional linear). Aquest darrer kernel també permet obtenir les importàncies dels tàxons respecte del fenotip predit (signatures microbianes). Per a les dades amb estructuració espacial i temporal utilitzam Multiple Kernel Learning i kernels per a sèries temporals, respectivament. El framework s’il·lustra amb tres bases de dades: la primera conté mostres de sòl, la segona mostres humanes amb una component espacial i la tercera, no publicada fins ara, dades longitudinals de porcs. Totes les anàlisis es contrasten amb els estudis originals (en els dos primers casos) i també amb els resultats dels RF. El nostre kernel framework no només permet una visió global de les dades, sinó que també dóna bons resultats a cada àrea d’aprenentatge. En les anàlisis no supervisades, els patrons detectats en estudis previs es conserven a la kPCA. En anàlisis supervisades, el SVM té un rendiment superior (o equivalent) al dels RF, mentre que les signatures microbianes són coherents amb els estudis originals i la literatura prèvia.La relación entre fenotipo e información genotípica es considerablemente intrincada y compleja. Los métodos de aprendizaje automático (ML) se han utilizado con éxito para la predicción de fenotipos en una gran variedad de problemas dentro de la genética y la genómica. Sin embargo, los datos biológicos suelen estar estructurados y pertenecer a tipos de datos "no estándar", lo que puede representar un desafío para la mayoría de los métodos de ML. Entre ellos, los métodos de kernel permiten un enfoque muy versátil para manejar diferentes tipos de datos y problemas mediante una familia de funciones llamadas de kernel. El objetivo principal de esta tesis doctoral es el desarrollo y evaluación de enfoques de kernel específicos para la predicción fenotípica, centrándose en problemas biológicos con tipos de datos o diseños experimentales estructurados. En la primera parte, usamos secuencias de proteínas mutadas del VIH (proteasa, transcriptasa inversa e integrasa) para predecir la resistencia a antiretrovirales. Proponemos dos funciones de kernel categóricas (Overlap y Jaccard) que tienen en cuenta las particularidades de los datos de VIH, como las mezclas de alelos. Los kernels propuestos se combinan con máquinas de vector soporte (SVM) y se comparan con dos funciones de kernel estándar (Linear y RBF) y dos métodos que no son de kernel: bosques aleatorios (RF) y un tipo de red neuronal, el perceptrón multicapa. También incluimos en los kernels la importancia relativa de cada posición de la proteína con respecto a la resistencia. Tener en cuenta tanto la naturaleza categórica de los datos como la presencia de mezclas obtenemos sistemáticamente mejores predicciones. El efecto de la ponderación es mayor en los inhibidores de la integrasa y la transcriptasa inversa, lo que puede estar relacionado con diferencias en los patrones mutacionales de las tres enzimas virales. En la segunda parte, ampliamos el estudio anterior para considerar que las posiciones de las proteínas pueden no ser independientes. Las secuencias mutadas se representan como grafos, ponderándose las aristas por la distancia euclidiana entre residuos obtenida por cristalografía de rayos X. A continuación, se calcula un kernel para grafos (el exponential random walk kernel) que integra los kernels Overlap y Jaccard. A pesar de las ventajas potenciales de este enfoque, no observamos una mejora en la capacidad predictiva. En la tercera parte, proponemos un kernel framework para unificar los análisis supervisados ​​y no supervisados del microbioma. Para ello, usamos una misma matriz de kernel para predecir fenotipos usando SVM y visualización a través de análisis de componentes principales con kernels (kPCA). Definimos dos kernels para datos composicionales (Aitchison-RBF y compositional linear) y discutimos la transformación de medidas de beta-diversidad en kernels. El kernel lineal composicional también permite la recuperación de importancias de taxones (firmas microbianas) del modelo SVM. Para datos con estructura espacial y temporal usamos Multiple Kernel Learning y kernels para series temporales, respectivamente. Ilustramos el kernel framework con tres conjuntos de datos: datos de suelos, datos humanos con un componente espacial y, un conjunto de datos longitudinales inéditos sobre producción porcina. Todos los análisis incluyen una comparación con los informes originales (en los dos primeros casos), así como un contraste con los resultados de RF. El kernel framework no solo permite una visión holística de los datos, sino que también da buenos resultados en cada área de aprendizaje. En análisis no supervisados, los principales patrones detectados en los estudios originales se conservan en kPCA. En análisis supervisados, la SVM tiene un rendimiento mayor (o equivalente) a los RF, mientras que las firmas microbianas son coherentes con los estudios originales y la literatura previa.The relationship between phenotype and genotypic information is considerably intricate and complex. Machine Learning (ML) methods have been successfully used for phenotype prediction in a great range of problems within genetics and genomics. However, biological data is usually structured and belongs to & 'nonstandard' data types, which can pose a challenge to most ML methods. Among them, kernel methods bring along a very versatile approach to handle different types of data and problems through a family of functions called kernels. The main goal of this PhD thesis is the development and evaluation of specific kernel approaches for phenotypic prediction, focusing on biological problems with structured data types or study designs. In the first part, we predict drug resistance from HIV-mutated protein sequences (protease, reverse transcriptase and integrase). We propose two categorical kernel functions (Overlap and Jaccard) that take into account HIV data particularities, such as allele mixtures. The proposed kernels are coupled with Support Vector Machines (SVM) and compared against two well-known standard kernel functions (Linear and RBF) and two nonkernel methods: Random Forests (RF) and the Multilayer Perceptron neural network. We also include a relative weight into the aforementioned kernels, representing the importance of each protein residue regarding drug resistance. Taking into account both the categorical nature of data and the presence of mixtures consistently delivers better predictions. The weighting effect is higher in reverse transcriptase and integrase inhibitors, which may be related to the different mutational patterns in the viral enzymes regarding drug resistance. In the second part, we extend the previous study to consider the fact that protein positions are not independent. Mutated sequences are modeled as graphs, with edges weighted by the Euclidean distance between residues, obtained from crystal three-dimensional structures. A kernel for graphs (the exponential random walk kernel) that integrates the previous Overlap and Jaccard kernels is then computed. Despite the potential advantages of this kernel for graphs, an improvement on predictive ability as compared to the kernels of the first study is not observed. In the third part, we propose a kernel framework to unify unsupervised and supervised microbiome analyses. To do so, we use the same kernel matrix to perform phenotype prediction via SVMs and visualization via kernel Principal Components Analysis (kPCA). We define two kernels for compositional data (Aitchison-RBF and compositional linear) and discuss the transformation of beta-diversity measures into kernels. The compositional linear kernel also allows the retrieval of taxa importances (microbial signatures) from the SVM model. Spatial and time-structured datasets are handled with Multiple Kernel Learning and kernels for time series, respectively. We illustrate the kernel framework with three datasets: a single point soil dataset, a human dataset with a spatial component, and a previously unpublished longitudinal dataset concerning pig production. Analyses across the three case studies include a comparison with the original reports (for the two former datasets), as well as contrast with results from RF. The kernel framework not only allows a holistic view of data but also gives good results in each learning area. In unsupervised analyses, the main patterns detected in the original reports are conserved in kPCA. In supervised analyses SVM has better (or, in some cases, equivalent) performance than RF, while microbial signatures are consistent with the original studies and previous literature.Universitat Autònoma de Barcelona. Programa de Doctorat en Genètic

    Kernel functions for HIV drug resistance prediction

    No full text
    Trabajo presentado al Seminario del CRAG (Internal seminar), celebrado el 7 de febrero de 2020.Peer reviewe

    HIV drug resistance prediction with weighted categorical kernel functions

    No full text
    Background: Antiretroviral drugs are a very effective therapy against HIV infection. However, the high mutation rate of HIV permits the emergence of variants that can be resistant to the drug treatment. Predicting drug resistance to previously unobserved variants is therefore very important for an optimum medical treatment. In this paper, we propose the use of weighted categorical kernel functions to predict drug resistance from virus sequence data. These kernel functions are very simple to implement and are able to take into account HIV data particularities, such as allele mixtures, and to weigh the different importance of each protein residue, as it is known that not all positions contribute equally to the resistance. Results: We analyzed 21 drugs of four classes: protease inhibitors (PI), integrase inhibitors (INI), nucleoside reverse transcriptase inhibitors (NRTI) and non-nucleoside reverse transcriptase inhibitors (NNRTI). We compared two categorical kernel functions, Overlap and Jaccard, against two well-known noncategorical kernel functions (Linear and RBF) and Random Forest (RF). Weighted versions of these kernels were also considered, where the weights were obtained from the RF decrease in node impurity. The Jaccard kernel was the best method, either in its weighted or unweighted form, for 20 out of the 21 drugs. Conclusions: Results show that kernels that take into account both the categorical nature of the data and the presence of mixtures consistently result in the best prediction model. The advantage of including weights depended on the protein targeted by the drug. In the case of reverse transcriptase, weights based in the relative importance of each position clearly increased the prediction performance, while the improvement in the protease was much smaller. This seems to be related to the distribution of weights, as measured by the Gini index. All methods described, together with documentation and examples, are freely available at https://bitbucket.org/elies-ramon/catkern

    HIV drug resistance prediction with weighted categorical kernel functions

    No full text
    Background: Antiretroviral drugs are a very effective therapy against HIV infection. However, the high mutation rate of HIV permits the emergence of variants that can be resistant to the drug treatment. Predicting drug resistance to previously unobserved variants is therefore very important for an optimum medical treatment. In this paper, we propose the use of weighted categorical kernel functions to predict drug resistance from virus sequence data. These kernel functions are very simple to implement and are able to take into account HIV data particularities, such as allele mixtures, and to weigh the different importance of each protein residue, as it is known that not all positions contribute equally to the resistance. Results: We analyzed 21 drugs of four classes: protease inhibitors (PI), integrase inhibitors (INI), nucleoside reverse transcriptase inhibitors (NRTI) and non-nucleoside reverse transcriptase inhibitors (NNRTI). We compared two categorical kernel functions, Overlap and Jaccard, against two well-known noncategorical kernel functions (Linear and RBF) and Random Forest (RF). Weighted versions of these kernels were also considered, where the weights were obtained from the RF decrease in node impurity. The Jaccard kernel was the best method, either in its weighted or unweighted form, for 20 out of the 21 drugs. Conclusions: Results show that kernels that take into account both the categorical nature of the data and the presence of mixtures consistently result in the best prediction model. The advantage of including weights depended on the protein targeted by the drug. In the case of reverse transcriptase, weights based in the relative importance of each position clearly increased the prediction performance, while the improvement in the protease was much smaller. This seems to be related to the distribution of weights, as measured by the Gini index. All methods described, together with documentation and examples, are freely available at https://bitbucket.org/elies_ramon/catkern.Peer Reviewe
    corecore